Goto

Collaborating Authors

 compute cluster




A Design space

Neural Information Processing Systems

Compute resources We trained the configurations on a large SLURM-based cluster with approximately 300,000 CPU-cores available in parallel. Data splits We split our performance dataset into a training, validation and test split in an approximately 70-15-15 ratio. In step 1, we treated every single configuration's data points across multiple epochs as time-series data, where each epoch is a single time step, thereby grouping together Adding bounds Since XGBoost is an unbounded regression model i.e. its codomain is This allows for a comprehensive analysis of optimizer's performance Dataset Average predicted runtime [CPU-d] CIFAR-10 2.0 Colorectal-Histology 0.2 Fashion-MNIST 2.2 This does not take into account carbon emissions for optimizing and training the surrogate benchmarks based on the data and indirect emissions such as creating the compute hardware and maintenance of the compute cluster. We noted that our surrogate models' performance on the Colorectal-Histology task was much worse In the first experiment, 20 configurations were randomly chosen from the set of configurations belonging to the highest fidelity group (N=5, W=16, R=1.0) that had already been evaluated on CIFAR-10 and Colorectal-Histology for our performance dataset and re-evaluated for 200 epochs using 2 different, randomly sampled sets of seeds for initialization. We present the results of this analysis in Table 11.



Meta builds world's largest AI superclusters for the future

FOX News

The CyberGuy Kurt Knutsson joins'Fox & Friends' to discuss the U.S.-Saudi investment summit and the debate over regulation as artificial intelligence continues to advance. What happens when one of the world's richest companies decides to go all-in on artificial intelligence? If you're Meta Platforms CEO Mark Zuckerberg, it means launching superclusters so large they could rival the footprint of Manhattan. Recently, Zuckerberg unveiled plans to invest "hundreds of billions of dollars" into next-generation AI infrastructure, including some of the largest compute clusters the world has ever seen. Meta's first supercluster, called Prometheus, is slated to go live in 2026.


VEXP: A Low-Cost RISC-V ISA Extension for Accelerated Softmax Computation in Transformers

Wang, Run, Islamoglu, Gamze, Belano, Andrea, Potocnik, Viviane, Conti, Francesco, Garofalo, Angelo, Benini, Luca

arXiv.org Artificial Intelligence

--While Transformers are dominated by Floating-Point (FP) Matrix-Multiplications, their aggressive acceleration through dedicated hardware or many-core programmable systems has shifted the performance bottleneck to non-linear functions like Softmax. Accelerating Softmax is challenging due to its non-pointwise, non-linear nature, with exponentiation as the most demanding step. T o address this, we design a custom arithmetic block for Bfloat16 exponentiation leveraging a novel approximation algorithm based on Schraudolph's method, and we integrate it into the Floating-Point Unit (FPU) of the RISC-V cores [1] of a compute cluster, through custom Instruction Set Architecture (ISA) extensions, with a negligible area overhead of 1 %. By optimizing the software kernels to leverage the extension, we execute Softmax with 162.7 less latency and 74.3 less energy compared to the baseline cluster, achieving an 8.2 performance improvement and 4.1 higher energy efficiency for the FlashAttention-2 kernel in GPT -2 configuration. Moreover, the proposed approach enables a multi-cluster system to efficiently execute end-to-end inference of pre-trained Transformer models, such as GPT -2, GPT -3 and ViT, achieving up to 5.8 and 3.6 reduction in latency and energy consumption, respectively, without requiring re-training and with negligible accuracy loss. Transformer-based models such as the GPT family [2] and the LLaMa family [3], have emerged as a cornerstone of machine learning, demonstrating state-of-the-art performance in diverse domains, including natural language processing (NLP), computer vision, and audio processing. At the core of their success is the Transformer architecture [4], which utilizes the self-attention mechanism to model complex relationships within input sequences. In encoders and the prefill stage of decoders, the computational complexity of attention layers scales quadratically with the input sequence length, leading to memory and computational overheads that necessitate mitigation by means of dedicated acceleration. This work was supported by the NeuroSoC project, funded under the European Union's Horizon Europe research and innovation programme (Grant Agreement No. 101070634). For each sequence length, the left bar shows unoptimized GEMM results, while the right bar reflects optimized GEMM results.


Why Europe's Efforts to Gain AI Autonomy Might Be Too Little Too Late

TIME - Tech

This week Microsoft announced that it would invest 3.2 billion ( 3.5 billion) in Germany over the next two years. The U.S. tech giant will use the money to double the capacity of its artificial intelligence and data center infrastructure in Germany and expand its training programmes, according to Microsoft vice chair and president Brad Smith. The move follows a similar announcement from November 2023, when Microsoft said it would invest 2.5 billion ( 3.2 billion) in infrastructure in the U.K. over the next three years. Both countries hailed the investments as significant steps that would permit them to compete on the world stage when it comes to AI. However, the investments are dwarfed by investments made by U.S.-based cloud service providers elsewhere, particularly in the U.S. As AI becomes increasingly economically and militarily important, governments are taking steps to ensure they have control over the technology that they depend on.


Compute at Scale: A Broad Investigation into the Data Center Industry

Pilz, Konstantin, Heim, Lennart

arXiv.org Artificial Intelligence

This report characterizes the data center industry and its importance for AI development. Data centers are industrial facilities that efficiently provide compute at scale and thus constitute the engine rooms of today's digital economy. As large-scale AI training and inference become increasingly computationally expensive, they are dominantly executed from this designated infrastructure. Key features of data centers include large-scale compute clusters that require extensive cooling and consume large amounts of power, the need for fast connectivity both within the data center and to the internet, and an emphasis on security and reliability. The global industry is valued at approximately $250B and is expected to double over the next seven years. There are likely about 500 large (above 10 MW) data centers globally, with the US, Europe, and China constituting the most important markets. The report further covers important actors, business models, main inputs, and typical locations of data centers.


Q-EEGNet: an Energy-Efficient 8-bit Quantized Parallel EEGNet Implementation for Edge Motor-Imagery Brain--Machine Interfaces

Schneider, Tibor, Wang, Xiaying, Hersche, Michael, Cavigelli, Lukas, Benini, Luca

arXiv.org Artificial Intelligence

Motor-Imagery Brain--Machine Interfaces (MI-BMIs)promise direct and accessible communication between human brains and machines by analyzing brain activities recorded with Electroencephalography (EEG). Latency, reliability, and privacy constraints make it unsuitable to offload the computation to the cloud. Practical use cases demand a wearable, battery-operated device with low average power consumption for long-term use. Recently, sophisticated algorithms, in particular deep learning models, have emerged for classifying EEG signals. While reaching outstanding accuracy, these models often exceed the limitations of edge devices due to their memory and computational requirements. In this paper, we demonstrate algorithmic and implementation optimizations for EEGNET, a compact Convolutional Neural Network (CNN) suitable for many BMI paradigms. We quantize weights and activations to 8-bit fixed-point with a negligible accuracy loss of 0.4% on 4-class MI, and present an energy-efficient hardware-aware implementation on the Mr.Wolf parallel ultra-low power (PULP) System-on-Chip (SoC) by utilizing its custom RISC-V ISA extensions and 8-core compute cluster. With our proposed optimization steps, we can obtain an overall speedup of 64x and a reduction of up to 85% in memory footprint with respect to a single-core layer-wise baseline implementation. Our implementation takes only 5.82 ms and consumes 0.627 mJ per inference. With 21.0GMAC/s/W, it is 256x more energy-efficient than an EEGNET implementation on an ARM Cortex-M7 (0.082GMAC/s/W).


Stability AI builds foundation models on Amazon SageMaker

#artificialintelligence

We're thrilled to announce that Stability AI has selected AWS as its preferred cloud provider to power its state-of-the-art AI models for image, language, audio, video, and 3D content generation. Stability AI is a community-driven, open-source artificial intelligence (AI) company developing breakthrough technologies. With Amazon SageMaker, Stability AI will build AI models on compute clusters with thousands of GPU or AWS Trainium chips, reducing training time and cost by 58%. Stability AI will also collaborate with AWS to enable students, researchers, startups, and enterprises around the world to use its open-source tools and models. "Our mission at Stability AI is to build the foundation to activate humanity's potential through AI. AWS has been an integral partner in scaling our open-source foundation models across modalities, and we are delighted to bring these to SageMaker to enable tens of thousands of developers and millions of users to take advantage of them. We look forward to seeing the amazing things built on these models and helping our customers customize and scale their models and solutions."